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Abstract 

In this work we address the problem of solving a series of underdetermined linear inverse 
problems subject to a sparsity constraint. We generalize the spike and slab prior distribu¬ 
tion to encode a priori correlation of the support of the solution in both space and time 
by imposing a transformed Gaussian process on the spike and slab probabilities. An ex¬ 
pectation propagation (EP) algorithm for posterior inference under the proposed model 
is derived. For large scale problems, standard EP algorithm can be prohibitively slow. 
We therefore introduce three different approximation schemes to reduce the computational 
complexity. Finally, we demonstrate the proposed model using numerical experiments 
based on both synthetic and real data sets. 

Keywords: Linear inverse problems, bayesian inference, expectation propagation, sparsity- 
promoting priors, spike and slab priors 


1. Introduction 


Many problems of practical interest in machine learning involve a high dimensional feature 
space and a relatively small number of observations. Inference is in general difficult for such 
underdetermined problems due to high variance and therefore regularization is often the key 
to extracting meaningful information from such problems (Tibshirani, [1994 ). The classical 
approach is Tikhonov regularization (also known as t '2 regularization), but during the last 
few decades sparsity have been an increasingly popular choice of regularization for many 
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problems, giving rise to methods such as the LASSO (Tibshirani, 1994), Sparse Bayesian 
Learning (Tipping 2001) and sparsity promoting priors (Mitchell and Beauchamp 1988). 


In this work we address the problem of hnding sparse solutions to linear inverse problems 
of the form 


y = Ax + e. 


( 1 ) 


where x G is the desired solution, y G is an observed measurement vector, 
A G is a known forward model and e G is additive measurement noise. We 

are mainly interested in the underdetermined regime, where the number of observations is 
smaller than the number of unknowns, i.e. N < D. In the sparse recovery literature it 
has been shown that the sparsity constraint is crucial for recovering x from a small set of 
linear measurements (Candes et ah, 2006). Furthermore, the degree of sparsity of x, i.e. 


K = ||a;||o, dictates the required number of measurements N for robust reconstruction of x. 
This relationship between the number of non-zero coefficients K and the number of mea¬ 


surements N has given rise to so-called phase transition curves (Donoho and Tanner, 2010). 
A large body of research has been dedicated to improve these phase transition curves and 


these endeavors have lead to the concepts of multiple measurement vectors (Cotter et al. 


2005 

) and structured sparsity ( 

Huang et ah 

2009 


The multiple measurement vector problem (MMV) is a natural extension of eq. ([^, where 
multiple measurement vectors yi,y 2 , ■ ■ ■ ,yT are observed and assumed to be generated 
from a series of signals xi,X 2 , ■ ■ ■, xt, which share a common sparsity pattern. In matrix 
notation, we can write the problem as 


Y = AX + E, 


( 2 ) 


where the desired solution is now a matrix X = X 2 ... ®t] £ 

for the measurement matrix Y G and the noise term E G 


and similar 
The assumption 

of joint sparsity allows one to recover X with significantly fewer observations compared 
to solving each of the T inverse problems in eq. ([^ separately (Cotter et ah, 2005). This 


MMV approach has also been generalized to problems where the sparsity pattern is evolving 


slowly in time (Ziniel and Schniter, 2013a). Structured sparsity, on the other hand, is a 


generalization of simple sparsity and seek to exploit the fact that the sparsity pattern of 
many natural signals contains a richer structure than simple sparsity, e.g. group sparsity 


(Jacob et ah, 2009b) or cluster structured sparsity (Yu et ah, 2012) 


In this paper we combine these two approaches and focus on problems, where the sparsity 
pattern of X exhibits a spatio-temporal structure. In particular, we assume that the row 
and column indices of X can be associated with a set of spatial and temporal coordinates, 
respectively. This can equivalently be interpreted as a sparse linear regression problem, 
where the support of the regressors is correlated in both space and time. Applications of 


such a model include dynamic compressed sensing ( 

Ziniel and Schniter 

2013a 

), background 

subtraction in computer vision ( 

Cevher et ah 

2009 

) and EEG source localization problem 


(Baillet et ah, 2001). 
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We take a Bayesian approach to modeling this structure since it provides a natural way 
of incorporating such prior knowledge in a model. In particular, we propose a hierarchi¬ 


cal probabilistic model for X based on the so-called spike and slab prior (Mitchell and 


Beauchamp, 1988). We extend the work in (] Andersen et al. 2014) introducing a smooth 


latent variable controlling the spatiotemporal structure of the support of X. We aim for full 
Bayesian inference under the proposed probabilistic model, but inference w.r.t. the exact 
posterior distribution of interest is intractable. Instead we resort to approximate inference 


using Expectation Propagation (Minka 

2001 

Opper and Winther 2000), which has been 

shown to provide accurate inference for spike and slab 

priors (jHernandez-Lobato et al. 

2013 

Hernandez-Lobato et al., 

2010 Jylanki et al. 

2014 

Peltola et ah, 2014). Our model 


formulation is generic and generalizes easily to other types of observations. In particular, 
we also combine the proposed prior with a probit observation model to model binary ob¬ 
servations in a sparse linear classification setting. 

The contribution of this paper is two-fold. First we extend the structured spike and slab 
prior and the associated EP inference scheme to incorporate both spatial and temporal 
smoothness of the support. However, the computational complexity of the resulting EP 
algorithm is prohibitively slow for problems of even moderate sizes of signal dimension D 
and length T. To alleviate the computational bottleneck of the EP algorithm we propose 
three different approximation schemes and evaluate them based on synthetic and real data 
sets. 

1.1 Related work 

In this section we briefly review some of the most common approaches to simple sparsity and 
their generalization to structured sparsity. The classical approach to sparsity is the LASSO 


(Tibshirani, 1994), which operates by optimizing a least squares cost function augmented 


with a ii penalty on the regression weights. Several extensions have been proposed in the lit¬ 
erature to generalize the LASSO to the structured sparsity setting, examples include group 
and graph LASSO pacob et al. 2009b). Prom a probabilistic perspective sparsity can be 
encouraged through the use of sparsity-promoting priors, i.e. prior distributions which favor 
sparse solutions. A non-exhaustive list of sparsity-promoting priors includes the Laplace 
distribution. Automatic Relevance Determination priors, the horseshoe prior and the spike 
and slab priors. All of these were originally designed to enforce simply sparsity, but they 
have all been generalized to the structured sparsity setting. The general strategy is to ex¬ 
tend univariate densities to correlated multivariate densities by augmenting the models with 
a latent multivariate variable, where the correlation structure can be controlled explicitly, 
e.g. using Markov Random Eields (Cevher et ah, 2009; Hernandez-Lobato et ah, 2011) or 


multivariate Gaussian distributions. Here we limit ourselves to consider the latter. 


Prom the probabilistic perspective optimizing with an ii regularization term can be in¬ 
terpreted as maximum a posteriori (MAP) inference under an i.i.d. Laplace prior distri¬ 
bution on the regression weights ( [Park and CaselTa 2008). The univariate Laplace prior 
has been generalized to a multivariate distribution with coupling between the regression 
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weights through a scale mixture formulation (Gerven et ah, 2009) 


Another approach is Automatic Relevance Determination (ARD) (Neal, 2012), which works 


by imposing independent zero mean Gaussian priors with individual precision parameters 
on the regression weights. These precision parameters are then optimized using a max¬ 
imum likelihood type II and the idea is then that the precision parameters of irrelevant 
features will approach zero and thereby force the weights of the irrelevant features to zero 
as well. Wu et al. extends the ARD framework to promote spatial sparsity by introducing a 
latent multivariate Gaussian distribution to impose spatial structure onto the precision pa¬ 


rameters of ARD giving rise to dependent relevanee determination priors (Wu et al., 2014b). 


The horseshoe prior is defined as a scale mixture of Gaussians, where a half-Gauchy distri¬ 


bution is used as prior for the standard deviation of the Gaussian density (Garvalho et al. 


2009). The resulting density has two very appealing properties for promoting sparsity, 


namely heavy tails and an infinitely large spike at zero. A generalization to the multivari¬ 


ate case can be found in (Hernandez-Lobato and Hernandez-Lobato, 2013). 


The spike and slab prior is an increasingly popular choice of sparsity promoting prior and 
is given by a binary mixture of two components: a Dirac delta distribution (spike) at zero 


and Gaussian distribution (slab) (Mitchell and Beauchamp, 1988). The spike and slab prior 


has been generalized to the group setting in (Hernandez-Lobato et al. 2013), to clustered 


sparsity setting in (Yu et al., 2012) and spatial structures in (Andersen et al., 2014, Nathoo 


et al., 2014). In the (Nathoo et ah, 2014) the spatial structure is induced using basis func¬ 


tions and in (Andersen et al., 2014) the structured is imposed using a multivariate Gaussian 


density. The latter is the starting point of this work. 


1.2 Structure of paper 

This paper is organized as follows. In sectionwe review the structured spike and slab prior 
and in section we discuss different ways of extending the model to include the temporal 
structure as well. After introducing the models we propose an algorithm for approximate 
inference based on the expectation propagation (EP) framework. We review the basics 
of EP and describe the proposed algorithm in section In section we introduce three 
simple approximation schemes to speed of the inference process and discuss their properties. 
Pinally, in section we demonstrate the proposed method using synthetic and real data 
sets. 


1.3 Notation 

We use bold uppercase letters to denote matrices and bold lowercase letters to denote 
vectors. Unless stated otherwise, all vectors are column vectors. Furthermore, we use 
the notation G and a. i G for the n’th row and z’th column in the matrix 

A G respectively. [K] denotes the set of integers from 1 to K, i.e. [K] = {1, 2,.., K}. 

We use the notation a o b to denote the element-wise Hadamard product of a and b and 
A®B ^ j^MATxMTV Kronecker product of matrices A G and B G We 

use N {x\m,V) to denote a multivariate Gaussian density over x with mean vector m and 
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covariance matrix V and ber {z\p) denotes a Bernoulli distribution on z with probability of 
p{z = 1) = p. 


2. The structured spike and slab prior 


The purpose of this section is to describe the structured spike and slab prior (Andersen 


et al., 2014), but first we briefly review the conventional spike and slab prior (Mitchell and 


X G 


nr 


[Beauchamp 1988; Titsias and Lazaro-Gredilla, 2011; Hernandez-Lobato et al. 2013). For 


the spike and slab prior distribution is given by 


D 

p{x\po,pq,tq) = [(1 -pq)5 (xj) + pqM {xi\pQ,To)\ , (3) 

i=l 


where 5 (x) is the Dirac delta function and po, Pq and tq are hyperparameters. In particular, 
Po is the prior probability of a given variable being active, i.e. p{xi / 0 ) = pq, and po, tq are 
the prior mean and variance, respectively, of the active variables. The spike and slab prior 
in eq. (j^ is also known as the Bernoulli-Gaussian prior since the prior can decomposed as 


D 

p{x\pq,po,tq) = 

z i=l 


D 

- Zi )5 (Xj) + ZiN (xjIpo, t-q)] n (^ i \ Po ) ■ 

i=l 


( 4 ) 


Thus, the latent binary variable Zi G {0,1} can interpreted as an indicator variable for the 
event x, 7 ^ 0. We will refer to z as the sparsity pattern or the support of x. In eq. (© and 
(Q we condition explicitly on the hyperparameters po,po,To, but to ease the notation we 
will omit this in the remainder of this paper. 


Due to the product form of the distribution in eq. ([^ and (Q, the variables Xj and xj 
are a priori assumed to be independent for i ^ j- This implies that the number of active 
variables follows a binomial distribution and hence, the marginal probability of x* and Xj 
being jointly active, is given by p{xi ^ 0, Xj 7 ^ 0) = Pq for all i ^ j- However, in many ap¬ 
plications the variables {x^}^^ might a priori have an underlying topographic relationship 
such as a spatial or temporal structure. Without loss of generality we will assume a spatial 
relationship, where di denotes the spatial coordinates of Xj. For such models, it is often a 
reasonable assumption that p{xi 7 ^ 0,Xj 7 ^ 0) should depend on ||dj — dj\\. For instance. 


neighboring voxels in fMRI analysis (Penny et al., 2005) are often more likely to be active 


simultaneously compared to two voxels far apart. Such a priori knowledge is neglected by 
the conventional spike and slab prior in eq. (©• 


The structured spike and slab model is capable of modeling such structure and is given 
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in terms of a hierarchical model 


D 

Pix\z) = n [(1 - Zi ) S ( Xi ) + ZiAf { xi \ po , To )] , 

2=1 

(5) 

D 

p{z\l) = n i^i\^ (^i)) > (/> : M (0,1), 

2=1 

(6) 

ph) =aa(7|po,s:o) , 

(7) 


where 7 is a latent variable controlling the structure of the sparsity pattern. Using this 
model prior knowledge of the structure of the sparsity pattern can be encoded using fiQ and 
XIq. The mean value Hq controls the expected degree of sparsity and the covariance matrix 
X)o determines the prior correlation of the support. The map cj) : M —)• (0,1) serves the 
purpose of squeezing 7 * into the unit interval and thereby (p ( 7 ^) represents the probability 
of Zi = 1. Here we choose (p to be the standard normal cumulative distribution function 
(CDF), but other choices, such as the logistic function, are also possible. 


Using this formulation, the marginal prior probability of the i’th variable being active 
is given by 


p{zi = 1) = Jp{zi = l|7i)p(7i)d7i = J {-Jilpi, So,m) d7i = p f ^ 




+ S, 


( 8 ) 


0,22 


From this expression it is seen that when pi = 0, the prior belief of Zi is unbiased, i.e. 
p{zi = 1) = 0.5, but when pi < 0 the variable Zi is biased toward zero and vice versa. If a 
subset of features {xj\j £ for some subset J C [D] is a priori more likely to explain the 
observed data y, then this information can be encoded in the prior distribution by choosing 
the prior mean of 7 such that pj > pi for all j £ J and for all i ^ J. However, in the 
remainder of this paper we will assume that the prior mean is constant, i.e. pi = uq for 
some i/Q £M. 


The prior probability of two variables, Xi and Xj, being jointly active is 

p{zi = 1, Zj = 1) = j p{ji)p (7j) AA ( 7 I/ 2 , So) d7. (9) 

If So is a diagonal matrix, 7 * and 7 j become independent and we recover the conventional 
spike and slab prior. On the other hand, if we choose So to be a stationary covariance func¬ 
tion of the form So ,7 = g {\\di — dj\\), we see that the joint activation probabilities indeed 
depend on the spatial distance as desired. Finally, we emphasize that this parametrization 
it not limited to nearest neighbors-type structures. In fact, this parametrization supports 
general structures that can be modeled using generic covariance functions. 


3. The spatio-temporal spike and slab prior 

In the following we will extend the structured spike and slab prior distribution to model 
temporal smoothness of the sparsity pattern as well. Let t £ [T] be the time index, then Xt, 
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Zi and are the signal coefficients, the sparsity pattern and the latent structure variable 
at time t. Furthermore, we define the corresponding matrix quantities X. = [a^i ®r]; 

Z = [zi Z 2 zt\ and F = [71 72 ... 7 t] . We consider four different type of temporal 

structure for the sparsity pattern. 


3.1 Time-independent and stationary sparsity patterns 


The simplest model is to treat the T vectors as T i.i.d vectors., i.e. 

T D 

p{X\Z) = nn [{l- Zi^t)5{Xi,t) + Zi^tM {xi^t\pO',TQ)\ , (10) 

t=li=l 

D T 

p{Z,T) = Y\p{zt\jt)'[lpht)- ( 11 ) 

t=l t=l 

This effectively corresponds to solving each of the T regressions problems in eq. 0 inde¬ 
pendently. Another simple approach is to use the so-called joint sparsity assumption and 
assume that the sparsity pattern is stationary or constant for all t G [T], and thus all Xt 
vectors share a common binary support vector 2 , i.e. 

T D 

p{X\z) = nn [(1 - Zi) 5 (Xi,t) -h ZiN {xi^t\pQ, To)] , (12) 

t=li=l 

p{z,l) =p{z\-i)p{-i). (13) 


The sharing of the binary support variable often yields much more robust inference, when 


the assumption of joint sparsity is justified (Cotter et ah, 2005; Zhang and Rao, 2011, Ziniel 


and Schniter 2013b). 


3.2 First order process prior 


The joint sparsity assumption is too strict for some applications. Often a good approxi¬ 
mation can be achieved by assuming the sparsity pattern is piecewise constant, but this 
effectively limits the number of measurement vectors that can be used in the estimation 
process. We now relax this assumption and assume that the support is slowly changing 


in time (Andersen et ah, 2015). To model this, we can impose a first order Gauss-Markov 


process on 7 * of the form 


P (7t|7f-i) =-^( 7*1 (1 - a)po + a7f_i,/?5]o) , (14) 

where the hyperparameters a G [0,1] and /3 > 0 control the temporal correlation and the 
“innovation” of the process, respectively. Furthermore, we assume that the prior distribu¬ 
tion on 7 i is given by 


p(7i) = AA(7 i|/xo,5]o) • (15) 

Under these assumptions the marginal distribution of 72 becomes 

P{l2) = jP (72I71) P(7i)d7i 

= AA(72|/xo, (a^-h/3) Eo) . (16) 
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Therefore, it follows by induction that if a and /3 satisfy + /3 = 1, then the marginal 
distribution of -jt is p(7t) = A/" (7t|/^0) ^o) for all t € [T]. Furthermore, it is seen that when 
a = 0, this model simplifies to the time independent model given in eq. (11). On the other 
hand, when a = 1, this model simplifies to the joint sparsity model in eq. (13). Therefore, 
this model can be seen as a generalization of these two extreme cases. 


3.3 Kronecker product formulation 

The first order model has the advantage that it factorizes over time, which often makes the 
resulting inference problem much easier. On the other hand, first order Markovian dynamics 
is often not sufficient for capturing long range correlations. Imposing a Gaussian process 
distribution on F with arbitrary covariance structure would facilitate modeling of long range 
correlations in both time and space. Therefore, the hierarchical prior distribution for JC 
becomes 


T D 

p{X\Z) = nn [(1 - Zi^t) s (Xi^t) + Zi^tX {xi^t\po,To)] (17) 

i=l i=l 
T 

P{Z\V) = (18) 

i=l 

p(r)=A7(r|po,5]o), (19) 

where the mean and covariance matrix, i.e. po £ and Sq £ are now defined 

for the full F-space. This model is more expressive, but the resulting inference problem be¬ 
comes infeasible for even moderate sizes of D and T. But if we assume equidistant sampling 
in the temporal dimension, the covariance matrix simplifies to a Kronecker product, i.e. 

P(r') = A7 (F|/i.Q, ^j^ejnporal ® Sgpatial) ; (^9) 

where bitemporal £ and Sspatiai £ This decomposition leads to more efficient 

inference schemes as we will discuss in section!^ 


The coefficients {xi^t} are conditionally independent given the support {zi^t}- For some 
applications it could be desirable to impose either spatial smoothness, temporal smooth¬ 


ness or both on the non-zero coefficients themselves (Wu et al., 2014a; Ziniel and Schniter 


2013a), but in this work we only assume a priori knowledge of the structure of the sup¬ 


port. Although temporal smoothness of Xi^t could easily be incorporated into the models 
described above. 


4. Inference using spatiotemporal priors 

In the previous sections we have described the structured spike and slab prior and how 
to extend it to model temporal smoothness as well. We now turn our attention on how 
to perform inference using these models. We focus our discussion on the most general 
formulation using as given in eq. Let Y = \yi y 2 ... yr] be an observation 

matrix, where yt G is an observation vector for time t. We assume that the distribution 














on factors over time and is given by 


T 

p{Y\X) = llp{yt\xt). (21) 

t=i 

We consider two different noise models: an isotropic Gaussian noise model and a probit 
noise model. The Gaussian noise model p{yt\xt) = Af [yt\Axt,a 2 l) is suitable for linear 
inverse problems with forward model A G or equivalently sparse linear regression 

problems with design matrix A G On the other hand, the probit model is suitable 

for modeling binary observations and is given by p{yt\xt) = 0^=1 4> {yt,nO,n,-Xt), where a^,- 
is the n’th row of A. For both models we further assume that the matrix A is constant 
across time. However, this assumption can be easily relaxed to have A depend on t. 

For both noise models the resulting joint distribution becomes 

p{Y,X,Z,T)=p{Y\X)p{X\Z)p{Z\T)p{T) (22) 

T T 

= Ylpivtlxt) [(1 - zt)oS (xt) + ztoAf [xt\0,Tl)] 

t=l t=l 

T 

nBer(2i|</.(7i))W(r|/ro,5]o). (23) 

t=i 

We seek the posterior distribution of the parameters X, Z and F conditioned on the obser¬ 
vations Y, which is obtained by applying Bayes’s Theorem to the joint distribution in eq. 


T T 

p{X,Z,T\Y) = -^Ylpiytlxt)]^ [{I - Zt) o 6 (xt) + Zt o Af {xt\0,Tl)] 

^ t=i t=i 

T 

nBer(;Zi|(/.(7i))W(F|/.Ao,So), (24) 

t=i 

where Z = p{Y) is the marginal likelihood of Y. Due to the product of mixtures in the 
distribution p{X\Z), the expression for the marginal likelihood Z involves a sum over 2^^ 
terms. This renders the computation of the normalization constant Z intractable for even 
small D and T. Hence, the desired posterior distribution is also intractable and we have to 
resort to approximate inference. 


In the literature researchers have applied a whole spectrum of approximate inference meth¬ 
ods for spike and slab priors, e.g. Monte Garlo-methods (Mitchell and Beauchamp, 1988), 
mean-field variational inference 


passing (Vila and Schniter 


Titsias and Lazaro-Gredilla 2011), approximate message 
20131) and expectation propagation (jHernandez-Lobato et al. 


2013; Andersen et ah, 2014). We use the latter since expectation propagation has been 


shown to provide accurate inference for spike and slab models (]Hernandez-Lobato et ak 


2015). 
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4.1 The Expectation Propagation Framework 

In this section we briefly review expectation propagation for completeness. Expectation 
propagation (EP) (Minka, 2001 Opper and Winther, 2000| is a deterministic framework 
for approximating probability distributions. Consider a probability distribution over the 
variables x G {xi, X 2 ,..., x^)} that factorizes into N components 


N 


/(®) = 


(25) 


2 = 1 


where Xi is taken to be a subvector of x. EP takes advantage of this factorization and 
approximates / with a function Q that shares the same factorization 


N 

Q{x )= 

2=1 


Xi 


(26) 


EP approximates each site term fi with a (scaled) distribution fi from the exponential 
family. Since the exponential family is closed under products, the approximation Q will 
also be in the exponential family. Consider the product of all fi terms except the j’th term 


Q\^ix) - Ylfiixi) - 


(27) 


The core of the EP framework is to choose fj such that (x) ^ fj (xj) Q\^ (x). By ap¬ 
proximating fj with fj in the context of , we ensure that the approximation is most accu¬ 
rate in the region of high density according to the cavity distribution . This scheme is im¬ 
plemented by iteratively minimizing the KL divergence KL ^fj (xj) (x)^^fj (xj) Q\^{x)^ . 

Since fj (xj) (x) belongs to the exponential family, the unique solution is obtained by 


matching the expected sufficient statistics (Bishop, 2006). That is, the variational mini¬ 
mization problem 


Q* = argmin KL^/j (xj) Q\^{x)\\q 


(28) 


is a convex problem and the unique solution is found by matching the expected sufficient 
statistics. Once the solution Q* is obtained we update the j’th site approximation as 


fj (xj) oc 


Q* jx) 

Q\^{x) 


(29) 


The steps in eq. (27), (28) and (29) are repeated sequentially for all j G [D] until convergence 
is achieved. 


4.2 The Expectation Propagation Approximation 

The EP framework provides flexibility in the choice of the approximating factors. This 
choice is a trade-off between analytical tractability and sufficient flexibility for capturing 
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the important characteristics of the true density. Consider the desired posterior density of 
interest 


T T 

p {X,Z,T\Y) oc [(1 - Zt)o6 [xt] + ZtoM {xt\Q,Tl)] 

t=i t=i 

V _ V V _ ^ 

h{X) f 2 {x,z) 

T 

nBer(z4|</>(7i))AA(r|/.to,So)_. (30) 

t=i '' ^ 

'-V-^ /4(r) 

h(Z,T) 


This posterior density is decomposed into four terms /* for i = 1, ..,4, where the hrst three 
terms can be further decomposed. The term fi {X) is decomposed into T terms of the form 
fi,t (xt) = p{yt\xt), whereas the terms /2 and /s are further decomposed as follows 


/i (^) = n (®i) = Ylpiyt\xt), 

t=i t=i 

T D T 

f2 (X, Z) = ]^/2,M {Xi^t,Zi,t) = Hn o ^ + Zi,toJ\f {xi^t\p,T)] , 


T D 


t=l i=l t=l i=l 

T D T D 

f3iz,r) = nn nn Ber {zi,t\(p ■ 

t=li=l t=li=l 


(31) 

(32) 

(33) 


Each fi^t term only depend on Xt, f2,i,t only depend on Xi^t and Zi^t and only depend 
on Zi^t and 7*^4. Furthermore, the terms couple the variables Xi^t and Zi^t, while 
couple the variables Zi^t and 'ji^f Based on these observations, we choose fi^t, and 
to have the following forms 

fi,t {xt) = M {xt\mi^t, 1 ^ 1 ,t) , ( 34 ) 

{xi,t, Zi^t) = AA (xi,t|m2,i,t, V2,i,t) Ber (72,1,*)) ( 35 ) 

h,i,t izi,t, 7 i,t) = { 7 i,t\p 3 ,j,t,^ 3 ,i,t) Ber (73,^7)) . ( 36 ) 

The exact term /i is a distribution of y conditioned on x, whereas the approximate term /i 
is a function of x that depends on the data y through mi and Vi etc. Finally, /i already 
belongs to the exponential family and does therefore not have to be approximated by EP. 

That is, h (r) = u (r) =M{T\po,^o). 

Dehne m2,i = [^2,1,1 m2,1,2 • • • "12,1,0] , 1 ^ 2,1 = diag (■U2,t,i V2,t,2 ■ ■ ■ ^2,1,0) and 

72,t = [72,71 72,1,2 • • • 72,1,0] and similarly for ^3,1 and 73,1, then the resulting 
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global approximation becomes 

T T 

Q {X, Z, r) = n AA Hat (xt\m 2 ,t, F 2 ,i) Ber {zt\(P{%t)) 


t=i^ 


't=i^ 


fi,t h,t 

T 

(7t|A3,t, Ber {zt\(t) AA (r|/xo, Sq) 


t=i'^ 


fi,t 

T 


fi 


t=l 


Y\_M Vtj Ber {zt\(l){it)) M (r|/i, S 


t=i 


(37) 


where the parameters of the global approximation are obtained by summing the natural 
parameters. In terms of mean and variance, we get 


Vt = 

rht = Vt 

±. = 




1 -1 




'2,t 


En 1 + S.:r^ 


1 -1 


ft — Sq ^ flQ + Sg ^ fl3 


{%t) = 


4> (72,i,t) 4> 


(!-<(> (!-<(> (73,j,4)) + <P {l2,i,t) 0 ' 


(38) 

(39) 

(40) 

(41) 

(42) 


To compute the global approximation, we need to estimate the parameters rhi^t, Vi,t, 1 ^ 2 ,t, 
^ 2 ,i) A3,t) ^ 3 ,t, 72 ,t and 73^4 for all t G [T] using EP. The estimation process of and 
Vi^t depends on the observation model being used, whereas the estimation procedure of the 
remaining parameters are independent on the choice of observation model. 


In the conventional EP algorithm, the site approximations are updated in a sequential 
manner meaning that the global approximation is updated every time a single site approx¬ 
imation (Minka, 2001) is refined. In this work we use the parallel update scheme to reduce 
the computational complexity of the algorithm. That is, we first update all the site approx¬ 
imations of the form f 2 ,i,t for i G [D], t G [T], and then we update the global approximation 
w.r.t. Xt and similarly for the and the global approximation w.r.t. 7 ^. Prom a message 


passing perspective this can be interpreted as a particular scheduling of messages (Minka 
2005). The proposed algorithm is summarized in Algorithm 


4.3 Estimating parameters for /i t 

The estimation procedure for fi^t depends on the choice of observation model. Here we 
consider two different observation models, namely the isotropic Gaussian and the probit 
models. Both of these models lead to closed form update rules, but this is not true for all 
choices of p{yt\xt). In general if p{yt\xt) factorizes over n and each term only depends on 
Xt through Axt, then the resulting moment integrals are 1-dimensional and can be solved 
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Initialize approximation terms fa for a = 1 , 2 , 3 , 4 and Q 
Repeat until stopping criteria 

— For each fi,n,t {For non-Gaussian likelihoods only): 

* Compute cavity distribution: (x — 

fl.n,t 

* Minimize: | w.r.t. 

* Compute: fi,n,t oc ^ ^ to update parameters ‘rhi^n,t->vi,n,t and 71,n,t. 

— For each 

* Compute cavity distribution: oc — 

f 2 .i,t 

* Minimize: KL(/27,tQ\2 ’^’^|^ gnew 

— ^ 2 , t, new _ 

* Compute: f2,i,t ^ q\ 2 i t update parameters ^1277,^277 and 7277- 

— Update joint approximation parameters: m, V and 7 

— For each 7377: 

* Compute cavity distribution: oc — 

* Minimize: KL(/377Q\3,7t||Q37,new^ ^ Q37,new 

— j^3,t,new _ 

* Compute: to update parameters and 'ya^i^t 

— Update joint approximation parameters: /i, S and 7 
Compute marginal likelihood approximation 


Algorithm 1: Proposed algorithm for approximating the joint posterior distribution over 
X, Z and P conditioned on Y using parallel EP. 


relatively fast using numerical integration procedures (Jylanki et al., 2011) if no closed form 
solutions exists. 


Under the Gaussian noise model, we have 


fi,t {xt) =p{yt\xt) = M {yt\Axt,a^l) . 


(43) 


Thus, fi^t is already in the exponential family for all t € [T] and does therefore not have 
to be approximated using EP. In particular, the parameters for fi^t are determined by the 
relations A and = ^A'^yt. Eor simplicity we also assume that the 

noise variance is constant for all t. 


Under the probit likelihood the term fi^t decompose to fi^t = 0^=1 In this case 

the update of each site approximation fi^t,n resembles the updates for Gaussian process 
classification using EP, see (Rasmussen and Williams, 2006} Hernandez-Lobato et al., 2010| 
for details. 
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4.4 Estimating parameters for / 2 ,t 

The terms f 2 ,t = YiiLi Zi^t) factor over i, which implies that we only need the 

marginal cavity distributions of each pair of Xi^t and Zi^f Consider the update of the 
j’th term at time t, i.e. f 2 ,j,t Zj^t)- The first step is to compute the marginal cavity 
distributions by removing the contribution of f 2 ,j,t ixj,t, zj^t) from the marginal of the global 
approximation Q using eq. ([^ 


(Xj,„ Zj,) = ^ ^ Ber (7'"’^' 

f2,j,t{Xj,t,Zj,t) ^ \ \ 


(44) 


When the approximate distribution belongs to the exponential family, the cavity distribu¬ 
tion is simply obtained by computing the differences in natural parameters. Expressed in 
terms of mean and variance, we get 





= V 


\2j,t 


-^2,it] \ 




%j,t- 


(45) 

(46) 

(47) 


The cavity parameter for 'jj^t in f 2 ,j,t is simply equal to (and vice versa) since ^ 2 ,j,t 
and 73,j,t are the only two terms contributing to the distribution over Zj^f Next, we form 
the tilde distribution f 2 ,j,tQ^'^’^’^ and compute the solution to the KL minimization problem 


in eq. (28) by matching the expected sufficient statistics. This amounts to computing the 


zeroth, first and second moments w.r.t. Xj^t 


Xm = y^^ j x^^.- f 2 ,j,t {xj,t, Zj^t) {xj^t, Zj^t) dxj^t for m = 0,1,2, (48) 

and the first moment of Zj^t 

= E / VI ■ h.,f (V,.. Vi) (V,i. Ht) dv- 


(49) 


^3,t 


For notational convenience we have dropped the dependencies of and Z\ on the indices 
t and j. Alternatively, the moments could be obtained by computing the partial derivatives 
of the log normalizer of the tilde distribution. 


The central moments of Q* in eq.(28) are given by 


pr 1 


T/ ) 1 ^2 Xf 


£IVil = |;. (50) 


Refer to Appendix for analytical expressions for these moments. Once Q* has been 
obtained, we can compute the new update site approximation for /2j,t using eq. (29) as 
follows 

flj,t ixj,t, Zj,t) = Q ^ Ber (z^-*|()) (72^,4)) , (51) 

O' \Xj,t,Zj^t) 
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where the new site parameters, i.e. and are obtained by computing differences 

in natural parameters in the same manner as for the cavity parameters in eq. (45) - (47) 


= 




-1 _ 1 r,\2d4 


-1 


-1 


= V2,j,t 


V [xj^t] ^ E [xj^t] - {V 




-1 


m 




(52) 

(53) 


The new site parameters for zj^t are obtained as (see Appendix [A| for details) 


(b) 


, / - * \ (^) __ _ _ 

" ~ iliffa'.) + Jtei) ” V (olm\^.<, VV.<) + N (ojm\=.< - p„, + T„ 

(54) 

where (a) follows from forming the quotient of the two Bernoulli distributions and (6) 
follows from straightforward algebraic reduction after substituting in the expression for the 
expectation of Zj^f 

4.5 Estimating parameters for /s t 

The procedure for updating = n£i is completely analogously to the procedure 
for f 2 ,t- Consider the update for the j’th term at time t, i.e. After computing the 


cavity distribution in the same manner as in eq. (45)-(47), we now compute the moments 


w.r.t. and zj^t of the (unnormalized) tilde distribution 

Gm = ^ y • h,j,t {zj,t, lj,t) {zj,u lj,t) d7i,t for m = 0,1, 2, 

Ex = J Zj^t • f3J,t {zj,t-, lj,t) lj,t) ^7j,t 


(55) 

(56) 




Given these moments, we can obtain the central moments for Q* in eq. (28) 


Eo 


T/f 

Go G 2 ’ 


E\zj,t] = ^- 


(57) 


Refer to Appendix for analytical expression of the moments. These moments completely 
determine Q* and the j’th site update at the t is computed analogous to f 2 j,t in eq. (51) 


using eq. (52), (53) and (54). 


4.6 The marginal likelihood approximation 

The marginal likelihood of the data, i.e. p{Y), can be useful for model selection, tuning 
hyperparameters etc. and is given by 

= Er) dxdr = Jfi {X) f2 {X, z) dx Jh {z, r) u (r) dr 

z z 

(58) 
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The exact quantity is intractable for this model, but the EP framework also provides an 
approximation of the marginal likelihood. The approximation is obtained by substituting 
the individual exact site terms, e.g. f 2 ,i,t, with a scaled version of the corresponding site 
approximation, e.g. S 2 ,i,tf 2 ,i,t, and then carrying out the marginalization. The scaling 
constants, e.g. S 2 ,i,t are chosen such that 


X: / /2,..< (X.,„ (x„, 2,.,) dx,, - I h...t (x„, 2„) dx,, 

(59) 


Zit 


and similarly for all the site terms fa,i,t for a G [4], i G [D], t £ [T]. As we discuss in the 
experimental section optimizing the hyperparameters using marginal likelihood approxima¬ 
tion can lead to suboptimal results for some problems. However, the marginal likelihood 


approximation can still be useful for monitoring convergence (Jylanki et ah, 2011). 


4.7 The computational details 

In the previous sections, we have described how to use EP for approximate inference using 
the proposed model, and in this section, we discuss some of the computational details of 
the resulting EP algorithm. 


4.7.1 Updating the global govariange matriges 


Given a set of updated site approximations, f 2 ,t = Oj we can compute the parameters 
for the global approximate distribution of Xt using eq. (pSj) and (39). Direct evaluation of 


eq. (38) results in a computational complexity of O {D^)- Recall, that N is assumed to be 
smaller than D. This implies that A has low rank. Eurthermore, the matrix 

V2,i is diagonal, and therefore we can apply the Matrix inversion lemma as follows 


Vt = V2,t - V2,tA^ 




AV2,tA^ 


-1 


AV, 


2,t- 


(60) 


The inverse of a^I + AV 2 ,tA^ = LtLf can be computed in O {N^) using a Cholesky 


decomposition. Thus, for N < D eq. (60) scales as O (A^D^). Moreover, eq. (45) shows 
that we only require the diagonal elements of Vt in order to update the site approximation 
parameters for f 2 ^t- Hence, we can further reduce the computational complexity by only 
computing the diagonal of V as follows 


diag 

Vt 

= diag 

V2,t 

— diag 



= diag 

V2,t 

— diag 


V2,tA^L-^L-^AV2,t 


V 


2,t 


(1^ {Rt o Rt)) , 


(61) 


where Rt G is defined as Rt = L~^A and 1 is a column vector of ones. The resulting 

computational cost is O (A^D). Similarly, the mean of the global approximate distribution 
of Xt, i.e. mt, can be efficiently evaluated as 


trit = V 2 ,t'nt - V 2 ,tRiRtVz^tVt, 


(62) 
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where rjt = V^^^rhi^t + The total cost of updating the posterior distribution for 

Xt for all t G [T] is therefore O (TN'^D). 


Unfortunately, we cannot get the same speed up for the refinement of the global approxi¬ 
mation of r since the prior covariance matrix Sq in general has full rank. However, we still 
only require the diagonal elements of the approximate covariance matrix S. We implement 


the update as advocated in (Rasmussen and Williams, 2006), i.e. 


S = 


1-1 




1 -1 


= So — S0S3 


Sg^SoSg^ + 7 


-1 




(63) 


where the second equality follows from the matrix inverse lemma. Again, we compute the 
required inverse matrix using the Cholesky decomposition, i.e. the total cost is O (D^T^), 
i.e. cubic w.r.t. both D and T. 


4.7.2 Initialization, Convergence and negative variances 

We initialize all the site terms to be rather uninformative, i.e. = 0, V 2 ,i,t = 10^, 

= 0, = 0, = 10^, = 0 for all i G [D] and t G [T] assuming standard 

scaling of the forward model A. 


There are in general no convergence guarantees for EP and the parallel version in par¬ 
ticular can suffer from convergence problems. The standard procedure to overcome this 
problem is to use “damping” when updating the site parameters 


/ * rl—a £OL 

•'old 


(64) 


where a G [0,1] is the damping parameter and fold is the site approximation at the previous 
iteration. Since both fold and /new belongs to the exponential family the update in eq. (64) 


corresponds to taking a convex combination of the previous and the new natural parameters 
of the site approximation. 


Negative variances occur “naturally” in EP (Bishop, 2006) when updating the site ap¬ 
proximations. However, this can lead to instabilities of the algorithm, non-positive semi- 
definitiveness of the posterior covariance matrices and convergence problems. We therefore 
take measures to prevent negative site variances. One way to circumvent this is to change 


a negative variance to -koo, which corresponds to minimizing the KL divergence in eq. (28) 


with the site variance constrained to be positive (Hernandez-Lobato et al. 2013). In prac¬ 
tice, when encountering a negative variance after updating a given site we use Voo = 10^ 
and (Too = 10® for f 2 A,t and /3 i+, respectively. 


5. Further Approximations 

As mentioned earlier, the updates of the global parameters for xt and P are the dominat¬ 
ing operations scaling as O (TN'^D^ and O respectively. The latter term becomes 
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prohibitive even for moderate sizes of D and T and thus the need for approximations. In 
this section we introduce three simple approximations to reduce the computational com¬ 
plexity of the refinement of the posterior distribution for F. The approximations and their 
computational complexities are summarized in table 


Approximation 


Complexity 

Storage 

Full EP 

(EP) 

o (r^L»^) 

O {T^D‘^) 

Low rank 

(LR) 

O {K^TD) 

O {KTD) 

Common precision 

(CP) 

O {TD"^ -h DT2) 

0{D‘^ + T^) 

Group 

(G) 

O (T=Da 

o KdJ) 


Table 1: Summary of approximation schemes for updating the global parameters for F. 


5.1 The low rank approximation 

The eigenvalue spectrum of many prior covariance structures of interest, i.e. simple neigh¬ 
borhoods etc., decay relatively fast. Therefore, we can approximate Sq with a low rank 
approximation plus a diagonal matrix XIq ~ U+ A, where S G is a diagonal 

matrix containing K largest eigenvalues and U G is a matrix containing the corre¬ 

sponding eigenvectors (Riihimaki et ah, 2014). The diagonal matrix A is chosen such that 
the diagonal in the exact prior covariance matrix 'Sq is preserved. This allows us to apply 
the matrix inversion lemma to compute the update of the posterior covariance matrix for 
F (see section 4.7.1). 


Computing the eigendecomposition of XIq G scales in general as O How¬ 

ever, when the prior covariance has Kronecker structure, the eigendecompositions of Sq = 
(8> Ss can be efficiently obtained from the eigendecompositions of 5]^ G and 

Us G In this case the eigendecomposition of Sq can be obtained in O -|- T^). 


Using a iF-rank approximation, the computational cost of refining the covariance matrix for 
F becomes O (^K^DT^ and the memory footprint is O {TDK). For a fixed value of K this 
scales linearly in both D and T. However, to maintain a sufficiently good approximation K 
can scale with both D and T. 


5.2 The common precision approximation 

Rather than approximating the prior covariance matrix as done in the low rank approxi¬ 
mation, we now approximate the EP approximation scheme itself. If the prior covariance 
matrix for F can be written in terms of Kronecker products we can significantly speed up 
the computation of the posterior covariance matrix of F by approximating the site preci¬ 
sions with a single common parameter. Let 03 G be a vector containing the site 

precisions (inverse variances) for the site approximations for all i G [D] and for all 

t G [r], then we make the following approximation 

^3 ~ (65) 
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where 9 is the mean of value of 63 . Assume the prior covariance matrix for T can be 
decomposed into a temporal part and a spatial part as follows Sq = ® Let Ut, Us 

and St, Ss be eigenbases and eigenvalues for and G respectively. 

The global covariance matrix is updated as S = Sq ^^0 + ^ 3 ^ ^ 3 . We now use the 

properties of eigendecompositions for Kronecker products to compute the inverse matrix 

^5]^ ( 8 ) 5]^ + ^ 3 ^ ~ (5]i ® + T 3 /) 

= [{Ut ® Us) {St ® Ss) {uf ® uj) + S 3 /] 

= {Ut ® Us) {St ®Ss + S 3 L) [uf ® (7j) , (66) 

where {St® Ss + ^^l) is diagonal and therefore fast to invert. The common precision 
approximation 'Sicp is then obtained as 

ScP = (^i ® Ss) (St ® Ss + S 3 /) S 3 / 

= {Ut ® Us) {St ® Ss) {St ®Ss + S 3 /)^^ {uf ® uj) S 3 . (67) 

Let M G denote the diagonal of {St ® Ss) {St ® Sg + S 3 /) then we can compute 

the diagonal of Scp as follows 


diag 



=> diag 



S3 {Ut ® Us),, M, {uf ® uT),^ 

k 

t3Y,iUt®Us)lM, 

k 

S3 {Ut oUt®UsO Us) M 


( 68 ) 


where o is the Hadamard-product. We now see that the desired diagonal can be obtained by 
multiplying a Kronecker product with a vector and this can be computed efficiently using 
the identity 

vec [ABC] = {C^ ® A) vec [B] . (69) 


Therefore, 


diag 


Scp 


S 3 • vec 


{Us o Us) vec-i [M] {Ut o Ut)^ 


(70) 


Since the Hadamard products can be precomputed, this scales as O {D‘^T + T‘^D). During 
the EP iterations we only need to store Us G and Ut G so the resulting 

memory footprint is O {D^ + T^). The posterior mean vector can also be computed efficient 
by iteratively applying the result from eq. ( [ 6 ^ 

^CPV = {Ut ® Us) diag [M] (Df ® U^) rj, (71) 

where rj = Sj^/i 3 + Sg ^/io. 
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The proposed approximation reduces the cost from O to O (^D^T + T^D) if X 

is regularly sampled in time. If the spatial covariance matrix is a Kronecker product itself, 
e.g. 'Sg = Sa; ( 8 ) Xly or Xl* = '^x ® Sy ( 8 ) the computational complexity can be further 
reduced. Such covariance structures could occur in image application or in analysis of fMRI 
data. 

From experiments we have observed that this approach significantly increases the num¬ 
ber of iterations until convergence. However, this problem can be removed by repeating the 
updates for the site approximations and the global approximation for F a few times 
before moving on to update the site approximations for Specifically, within each EP 

iteration we repeat the updates for posterior distribution of F 5 times. The added com¬ 
putational workload is still negligible compared to full EP. Furthermore, for some problem 
instances CP-EP can oscillate around a solutions. This is alleviated heuristically by de¬ 
creasing the damping parameter a by 10 % if the approximate log likelihood decreases from 
one iteration to another after the first 100 iterations. 


5.3 Grouping the latent structure variables 

Consider a problem, where the spatial coordinates d* for each Xi^. are on a fixed grid. 
Assume the characteristic length-scale of the sparsity pattern is large relative to the grid 
size, then support variables {zi} in a neighborhood could “share” the same 7 -variable with 
little loss of accuracy (Jacob et ah, 2009a Hernandez-Lobato et ah, 2013). This grouping 


of the latent variables could either be in the spatial, temporal or both dimensions. Let G 
be the number of groups and g : [D] x [T] —)• [G] be a grouping function that maps from a 
spatial and temporal index to a group index, then the grouped version of the prior is given 
by 


T D 

p{Z\j) = nn Bei , (72) 

i=l i=l 

p( 7 ) = A7(7|/xo,E:o) , (73) 

where po G and Sq G are the prior mean and covariance for the new grouped 

model. The resulting computational complexity is indeed determined by the size of the 
groups. For example, assume that the support variable for a given problem have been 
grouped in groups of 2 in both the spatial dimension and temporal dimension, then the 
total number of groups becomes G = = \DT and the resulting computational cost 

is reduced to a fraction of ^ of the cost of the full EP scheme. Eurthermore, if necessary 
both the low rank and the common precision approximation can be applied on top of this 
approximation. 


6. Numerical Experiments 

In this section we conduct a series of experiments designed to investigate the properties of 
the proposed model and the associated EP inference scheme. 

We describe six experiments with a Gaussian observation model and one experiment with a 


20 









probit observation model. In the first four experiments, we focus on problem instances with 
a single measurement vector. Experiment 1 investigates the effect of the prior by analyzing 
a synthetic data set with a range of different values for the hyperparameters. In the sec¬ 
ond experiment, we compare the three different approximation schemes (low rank, common 
precision, group) to standard EP. Specifically, we analyze a synthetic data set with all four 
methods and discuss the results. Experiment 3 is designed to investigate how the EP algo¬ 
rithms perform as a function of the undersampling ratio N/D giving rise to the so-called 
phase transition curves (Donoho and Tanner, 2010). In this experiment we compare the 


proposed methods to state-of-the-art methods using simulated data. In Experiment 4 we 
apply our model to a binary classification task, where the goal is to discriminate between 
utterances of two different vowels using log-periodograms as features. 

In Experiment 5-7, we turn our attention to problems with multiple measurement vec¬ 
tors. In the fifth experiment, we qualitatively study the properties of the proposed methods 
in the multiple measurement vector setting. We demonstrate the benefits of modeling both 
the spatial and temporal structure of the support and discuss the marginal likelihood ap¬ 
proximation for hyperparameter tuning. Experiment 6 studies the performance of the EP 
algorithms as a function of the undersampling ratio when multiple measurement vectors 
are available and compare the results to competing methods. Einally, in Experiment 7 we 


apply the proposed method to the EEG source localization problem (Baillet et al. 2001). 


These problems are in general very challenging because they are characterized by being 
both extremely ill-posed and extremely ill-conditioned. 


Eor the experiments with synthetic data with use Normalized Mean Square Error (NMSE) 


and F-measure (Rijsbergen, 1979) to quantify the performance of the algorithms. In par¬ 
ticular, we compute the NMSE between the posterior mean of X, i.e. JC = 






and the true solution Xq to quantify the algorithms abilities to reconstruct the true signal 

Xo 


NMSE = **^ ^°**^ (74) 

where || • ||f is the Erobenius norm. We use the E-measure to quantify the algorithms abilities 
to recover the true support set 


^ 2 • precision • recall 

precision -|- recall 

where precision (positive predictive value) is the fraction of non-zero weights found by the 
algorithm that are also non-zero in the true model, while recall sensitivity is the fraction of 
non-zeros in the true model that have been identified by the algorithm. Here a given weight 
Xi^t is identihed as being non-zero if the posterior mean of Zi^t is above 0.5. 

Our code is available at http://niichaelriisandersen.com/projects. 
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6.1 Experiment 1: The effect of the prior 

In this experiment we investigate the effect of the structured spike and slab prior on the 
reconstructed support set. For simplicity we only consider the spatial structure, i.e. T = 1, 
and we further assume that the spatial coordinates are on a regular ID grid. We construct 
a sparse ID test signal xq G where the active coefficients are sampled from a cosine 

function, see Figure [Tj Based on this signal we generate a synthetic data set using the a 
linear model y = Axq + e, where Aij ~ AA(0, 1), e is isotropic Gaussian noise and the 
number of samples is = 0.5D. The prior on 7 is of the form ^( 7 ) = AA( 7 |p, E), where 

/ \ 

H = ly ■ 1 and Ejj = Kexp ( I • We sample the length-scale R equidistant 100 times 

in the interval [l0“^, 50] and run the algorithm on the synthetic data set for each value of 
R. For this experiment we use the standard EP scheme, i.e. no further approximations. 
The noise variance is fixed to the “true value”, i.e. the values used to generate the data 
set and the remaining hyperparameters are fixed ly = 0, t = 1, k = 5. The posterior results 



Figure 1: Synthetic test signal xq. The active coefficients are sampled from a cosine function 


are shown in the panels in leftmost column in Figure The topmost panel shows the 
marginal likelihood approximation as a function of the spatial length scale R. The panel 
in the middle shows the posterior mean of 7 *, i.e. [ 7 ^], as a function of the scale R. 

That is, each column in the image correspond to the posterior mean of 7 for a specific value 
of R. The panel in the bottom shows a similar plot for the posterior support probabilities, 
i-e- ^Q{z,\v) [^i]- 

When R is close to zero the posterior mean vectors for both 7 and z are very irregular 
and resemble the solution of an independent spike and slab prior. As the length-scale in¬ 
creases the posterior mean vector 7 becomes more and more smooth and eventually give 
rise to well-defined clusters in the support. The algorithm recovers to correct support for 
R G [3,15]. However, at i? ss 15 a discontinuity is seen. Since the prior distribution on z 
does not exhibit any phase transitions w.r.t. this is likely to be an effect of a unimodal 
approximation to a highly multimodal distribution. The discontinuity is also present in 
the marginal likelihood approximation as seen in the top panel and therefore one should 
be cautious when optimizing the marginal likelihood using line search based methods. We 
repeated this experiment for multiple realizations of the noise and the discontinuity was 
only present in few of the realizations. 
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The rightmost column in Figure show equivalent Figures for a sweep over prior mean 
of 7 i, i.e. V, where it is seen that the algorithm recovers the correct support for v G [—15, 0]. 
It is seen that when v is below some threshold r'lower; the posterior mean of Zi is close to 
zero for all i G [D]. The total number of active variables increases with v until v surpasses 
an upper threshold t'upper, where all variables are included in the support set. It is also 
seen that variables are included cluster-wise rather than individually, which gives rise to 
discontinuities in the marginal likelihood in the topmost panel. 
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Spatial length-scale Prior mean for 7 

(a) Log likelihood (b) Log likelihood 
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(e) Posterior mean of 2; (f) Posterior mean of a 


Figure 2: The effect of the spatial prior distribution. The left-most column shows the 
approximate marginal log likelihood, posterior mean of 7 and posterior mean 
of z as a function of the prior length-scale of 7 . The right-most column shows 
similar plots as a function of the prior mean z^o of 7 . 
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6.2 Experiment 2: Comparison of approximation schemes 

In this experiment we investigate the properties of the proposed algorithm and the three 
approximation schemes: standard EP (EP), the low rank approximation (LR-EP), the 
common precision approximation (CP-EP) and the group approximation (G-EP). Using 
a similarly setup as in Experiment 1 we generate a sample of 70 ; -^o and xq from the 
prior distribution specified in eq. ([^-Q with a squared exponential kernel and parameters 
R = 75, n = 100 and r = 1. The specific realization is shown in the leftmost panels in 
Figure We generate observations from a linear measurement model y = Axq + e, where 
Aij ~ J\f (0,1) and the noise variance a'^ is chosen such that the signal-to-noise is 20dB. 
We now seek to recover xq, zq and 70 from the observed measurements y using standard 
EP and the three approximation schemes. For the low rank approximation we included 7 
eigenvectors corresponding to 99% percent of the variance and for G-EP we used a group 
size of 10 variables. Columns 2-5 in Figure show the posterior mean values for x,z, and 
7 for EP, LR-EP, CP-EP and G-EP, respectively. 



(a) True 70 (b) jep (c) jlr (d) jcp (e) 70 



0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 


(f) zo 


(g) ZEP 


(h) zlr 


(i) zcp 


(j) ZG 


0 100 200 300 400 




(k) True xo (1) xep 


(m) xlr 


(n) xcp 


(o) XQ 


Figure 3: Comparison of approximation schemes. The panels in the first column shows a 
realization x,z, and 7 from the prior distribution in eq. @- 0 . The columns 2-5 
show the posterior mean quantities for EP, LR-EP, CP-EP and G-EP, respectively. 
The pink shaded areas depict ± std. deviation. 


Consider first the posterior mean and std. deviation for 7 for standard EP (topmost row, 
second column). In the region where 70 is positive the posterior mean accurately recovers 
7 o with high precision, whereas both the accuracy and the precision is lower in regions 
where 70 is negative. The reason for the additional uncertainty is that negative values of 
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7 j are in general associated with small value of |xj|, but |xj| can be small for two reason. 
Recall that each Xi can be considered as a product Xi = Zi- Ci, where Zi £ {0,1} and c* G M. 
If Zi = 0, then clearly Xi = 0, but we can still have that x, « 0 even if Zi = 1 and Ci ^ 0 
and thus the increased uncertainty. 

We can now compare the posterior distribution of 'ft for standard EP with the three approx¬ 
imations. Based on visual inspection one cannot tell the difference between the standard 
EP and EP with the low rank approximation, but the results for CP-EP and G-EP are quite 
different. Eor CP-EP it is seen that the posterior mean in the positive region is accurate, 
but the CP-EP approximation underestimates the uncertainty in general. The grouping 
effect for G-EP is clearly seen in the topmost panel in the last column, but despite the 
staircase pattern the posterior mean and variance are accurately recovered. The second 
and the third for in Eigure show the reconstructions of x and 2 . We see that all of the 
four approaches accurately reconstruct the true quantities despite the approximation of the 
posterior distribution of 7 . 


6.3 Experiment 3: Phase transitions for a single measurement vector 


It is well-known that the quality of the inferred solutions strongly depend on both the 
undersampling ratio 6 = N/D and the sparsity level K = ||a;||o- In fact, linear inverse 
problems have been shown to exhibit a phase transition from almost perfect recovery to 


no recovery of solution x in the (d, R)-space (Donoho and Tanner, 2010; Donoho et al. 


2011). In this experiment we investigate how the EP algorithms perform as a function of 


the undersampling ratio N/D. As in Experiment 2 we generate 100 realizations of xq from 
the prior such that ||a;||o is fixed to K = = 125 and D = 500. We condition the samples 

on ||a;||o to reduce the variance of the resulting curves for NMSE and E-measure because 
the recovery performance in general depends on the number of non-zero coefficients. Based 
on these realizations we generate measurements y £ through the linear observation 
y = Axq + e for a range of values for N. The forward model A is a Gaussian i.i.d. en¬ 
semble, where the column have been scaled to unit 1 ' 2 -norm. The noise e W( 0 ,Ct 2 ) ig 
zero-mean Gaussian noise, where the noise variance is chosen such that the signal-to- 
noise (SNR) ratio is fixed to 20dB. We choose values of N such that ^ G [0.05,0.10, ...,0.95]. 


We compare our methods with Bernoulli-Gaussian Approximate Message Passing (BG- 
AMP) (Vila and Schniter, 2013), Orthogonal Matching Pursuit (OMP) ( Needell and Trop^ 
2010 ) and an “oracle” estimator, which computes a ridge regression estimate based on 
knowledge of the true support. The regularization parameters for ridge regression is fixed 
to A = 10“^ for all runs. Einally, for comparison we also apply the proposed EP algorithm 
with a diagonal prior covariance matrix, which correspond to the conventional independent 
spike and slab prior (lEP). We provide BG-AMP and OMP with prior knowledge of the true 
number of non-zero variables in xq and the noise variance used to generate the observations. 
The results are shown in Eigure]^ 

The two black curves in Eigure show the results for BG-AMP (black, solid) and EP 
with diagonal prior covariance (black, dashed), i.e. both methods are based on conven- 
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Iterations 



(a) NMSE (b) F-measure 



Figure 4: Performance of the methods as a function of undersampling ratio N/D for T = 1. 
The results are average over 100 realization. 
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tional independent spike and slab priors. It is seen that the methods with prior correlation, 
i.e. EP (blue), CP-EP (cyan), &: LR-EP (red, dashed), are uniformly better than the meth¬ 
ods with independent priors both in terms of NMSE and E-measure. In fact, these methods 
achieve the performance of the support-aware oracle estimator around N/D = 0.6 in terms 
of NMSE. Eurthermore, it is also seen that the two approximations CP-EP and LR-EP are 
indistinguishable from the full EP algorithm in terms of accuracy. Panel (c) and (d) show 
the number iterations and the run time per iteration for the EP-based methods. Here it is 
seen that lEP has the lowest computational complexity per iteration, but the CP-EP and 
LR-EP are almost as fast. 


6.4 Experiment 4: Phoneme recognition 



Frequency bin 



Frequency bin 


(a) The two classes 


(b) Difference in mean 


Eigure 5: a) The frequency-wise mean and standard deviation of the log-periodogram of 
the two spoken vowels ”aa” and ”ao”. b) The difference of the two mean signals. 


In this experiment we consider the task of phoneme recognition (Hastie et ah, 2001 


Hernandez-Lobato et al. 


2011). In particular, we consider the problem of discriminating 
and ”ao” based on their log-periodograms. The data set 
aa” and ” ao”, respectively, along with 


between the spoken vowels ”aa" ana "ao 
consists of 695 and 1022 utterances of the vowels 
their corresponding labels. The response variable in this experiment is binary and therefore 
we use a probit model rather than a Gaussian observation model. 

Each log-periodogram has been sampled at 256 uniformly spaced frequencies. The left¬ 
most panel in Eigure shows the frequency-wise mean and standard deviation of the two 
classes and the right-most panel shows the difference of the two mean signals. We choose a 
squared exponential kernel for 7 since it is assumed that frequency bands rather than single 
frequencies are relevant for discriminating between the two classes. The total number of 
observations is 1717 and we use N = 150 examples for training and the remaining 1567 
examples for testing. We repeat the experiment 100 times using different partition into 
training and test sets. The training and test sets are generated such that the prior odds of 
the two classes are the same for both training and test. The number of input features is 
D = 257 (256 frequencies -|- bias). 
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Method 

Mean test error 

NBSBC 

19.544 

(± 0 . 101 ) 

EP 

19.472 

(±0.105) 

LR99-EP 

19.455 

(± 0 . 102 ) 

LR95-EP 

19.482 

(±0.103) 

CP-EP 

19.502 

(±0.103) 

G 8 -EP 

19.411 

(±0.103) 

G16-EP 

19.436 

(±0.104) 


Table 2: Results for phoneme classification experiment 


In this experiment both N and D are relatively small and therefore we can run the full 
expectation propagation algorithm without any approximations in reasonable time. How¬ 
ever, we run full EP and the three proposed approximations on this data set to evaluate 
the accuracy of the proposed approximation schemes. For the low rank approximation we 
choose the number of eigenvectors such that 99% and 95% of the total variance in Sq are 
explained, respectively. For the group approximation we choose group sizes of 8 and 16, 
respectively. For all the methods, we initialize the length-scale to R = 50 and the magni¬ 
tude K = 100 and optimize these using simple gradient descent based optimization of the 
marginal likelihood approximation from EP. We choose the prior mean of 7 to = 0 to 
reflect our ignorance on the number of active weights. We have observed that optimizing 
the prior mean and variance of the ’’slap component” using the marginal likelihood approx¬ 
imation gives sub-optimal results. Instead we fix po = 0, tq = 10“^ for all methods and all 
cross validation splits. 


We compare our methods against the use the NBSBC method, which imposes smooth¬ 
ness on the support using a Markov Random Field and has been shown to outperform 
competing method on this problems ( ]Hernandez-Lobato et ah 2011). Table summarizes 
the performance of the methods based on number of misclassification on the test set. First 
of all, we see that the proposed EP algorithm achieves similar performance in terms of mean 
test error as state of the art methods. Furthermore, we see that applying the approximation 
schemes does not degrade performance significantly. 


In Figure we compare the posterior quantities for the full EP algorithm and the three 
approximations. The leftmost column shows the results for full EP and the subsequent 
columns shows the results for LR-EP, CP-EP and G-EP, respectively. Visual inspection of 
the Figure shows that except for the stair-case pattern of the group approximation, the full 
EP posterior distribution over 7 is well approximated by all three methods and the similar 
for the posterior distribution on x. 
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Frequency bin 



Frequency bin 




Frequency bin Frequency bin 


(a) 7 ep 


(b) 7iP99 


(c) 7CP 


(d) 7G8 



(e) zep 


(f) ZLR99 


is) ZCP 


(h) ZG8 



(i) xep 


(j) XlR99 


(k) XCP 


(1) XG8 


Figure 6: Comparison of the posterior quantities for full EP and the three approximations. 


6.5 Experiment 5: Spatio-temporal example 

In the previous experiments we have focus on problems with only one measurement vector, 
whereas in this and the following experiments we consider problems with multiple mea¬ 
surement vectors. Specifically, in this experiment we qualitatively study the properties 
of the proposed algorithm in the spatio-temporal setting using simulated data. We have 
synthesized a signal, where the support set satisfies the following three properties: 1) non- 
stationarity, 2) spatiotemporal correlation & 3) the number of active coefficients change 
over time. The support of the signal is shown in panel (a) in figure]^ Based on the support 
set we sample the active coefficients from an zero-mean isotropic Gaussian distribution. 
We then observe the desired signal X through linear measurements Y = AX + E, where 
both the forward model A and the noise E is sampled from zero-mean isotropic Gaussian 
distribution. The noise variance is scaled such that the SNR is 5dB. We apply our proposed 
method to estimate X given the forward model A and the observations Y. 

Panel (b) in figure shows the reconstructed support Z using the proposed EP algorithm 
with a diagonal prior covariance matrix on P, i.e. with no prior correlation in the support. 
The panels (c)-(f) shows the reconstructed support for full EP, low rank EP, common pre- 


30 






























cision EP and group EP, which all assumes that the prior covariance matrix for P is a 
Kronecker product of two squared exponential components. For the group approximation 
the group size is chosen to 5 &: 10 in the spatial and temporal dimension, respectively, 
and for the low-rank approximation the rank is chosen such that the minimum number of 
eigenvectors explain 99% of the variance in the prior. The hyperparameters are chosen by 


optimizing the approximate marginal likelihood using a simple ADAGRAD (Duchi et al. 


2011) scheme. By inspecting the panels (a)-(f) it is seen that the reconstructed support is 


qualitatively improved by modeling the additional structure of the support. Furthermore, 
the reconstructions using the approximation schemes do not differ significantly from the 
result using full EP. 


Panels (g)-(j) shows the marginal likelihood approximation as a function of the spatial 
and temporal length scale of the prior covariance matrix for the proposed methods, while 
the panels (k)-(g) show the corresponding NMSE between the reconstructed coefficients 
X and the true coefficient X. The black curves superimposed on the marginal likelihood 
plots show the trajectories of the optimization path for the length-scales of the prior dis¬ 
tribution starting from four different initial values. It is seen that the marginal likelihood 
approximation is unimodal and correlates strongly with the NMSE surface, which suggests 
that it is reasonable to tune the length-scales of the prior covariance using the marginal 
likelihood approximation. However, we emphasize that this is not always the case and for 
some problems this indeed leads to suboptimal results. 
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Figure 7: Reconstruction using the spatiotemporal model. D = 100, T = 30, N = 33 and 
SNR = 5dB. 
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6.6 Experiment 6: Phase transitions for multiple measurement vectors 


The multiple measurement vector problem also exhibits a phase transition analogously to 


the single measurement vector problem described in Experiment 3 (Cotter et al., 2005 Ziniel 


and Schniter 

2013a 

Andersen et al. 

2015 


2015). In this experiment we investigate how location 


of the phase transition of the EP algorithms improves when the sparsity pattern of the 
underlying signal is smooth in both space and time and multiple measurement vectors are 
available. Using a similar setup as in Experiment 3, we generate 100 realizations of X from 
the prior specified in eq. ([T^)- (1T§ such that the total number of active components is hxed 
to K = \DT = 2500. The covariance structure is of the form Eq = bitemporal ® blspatiaU 
where both the temporal and spatial components are chosen to be squared exponential 
kernels. Figure shows an example of a sample realization of F, Z and X from the prior 
distribution. For each of the realizations of X, we generate a set of linear observations 
Y = AX + E, where the forward model A is Gaussian i.i.d. and Ent ~ AA (O, cr^) is zero- 
mean Gaussian scaled such that the SNR is hxed to 20dB. For reference we compare our 


methods against BG-AMP (Vila and Schniter, 2013) and DCS-AMP (Ziniel and Schniter 


2013a). The DCS-AMP method uses approximate message passing inference based on spike 


and slab priors, but assumes that the binary support variables evolve in time according to 
a hrst order Markov process. Both methods, i.e. BG-AMP and DCS-AMP, are informed 
about the true number of active coefficients and the true noise level. The results are shown 
in hgure 



Temporal dimension Temporal dimension Temporal dimension 

(a) Sample of F (b) Sample of Z (c) Sample of X 

Figure 8: Example of a realization of the synthetic signals in Experiment 6. 

The method LR-EP (blue) assumes that the sparsity pattern is spatially correlated, but 
independent in time. The method LR-K-EP (red, dashed) applies the low rank approxima¬ 
tion to EP and assumes that the sparsity pattern is spatio-temporally correlated and that 
the prior covariance for F is described by a Kronecker product (hence the prefix “-K”). Sim¬ 
ilarly, the methods CP-K-EP (cyan) and G-K-EP (magenta) have the same assumptions 
about the sparsity pattern, but use the common precision approximation and the group 
approximation, respectively. For G-K-EP we use groups of 5 in both the spatial dimension 
and temporal dimension. In this experiment we do not run full EP with the spatiotemporal 
prior because it would be prohibitively slow. 

On panel (a) in figure it is seen that as the number of measurements increase, all 
methods eventually reach the NMSE level of the support-aware oracle estimator, but the 
general picture is that the more structure a method takes into account, the better it performs 
in terms of NMSE. In particular, at N/D « 0.3 BG-AMP achieves NMSE « 0.63 and LR- 
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(a) NMSE 




Undersampling ratio N/D 


(c) Iterations 



(d) Time per iterations 


Figure 9: 


Performance of the methods as a function of undersampling ratio N/D for T 
100. The results are average over 100 realization. 
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EP achieves NMSE « 0.44 while LR-K-EP and CP-K-EP achieve NMSE 0.24. Panel 
(b) shows a similar picture for F-measure. Furthermore, it is seen that the performance 
of LR-K-EP and CP-K-EP are similar and slightly better than the performance of G-K- 
EP both in terms of NMSE and F-measure. However, the G-K-EP approximation has the 
lowest computational complexity per iteration as seen in panel (d). In terms of run time the 
EP-methods are slower compared to the AMP-based methods, which have linear complexity 
in all dimensions. However, the EP methods are not limited to Gaussian i.i.d. ensembles 
as the AMP-based methods are. 


6.7 Experiment 7: EEG Source Localization 


In the final experiment, we apply the proposed method to EEG source localization (Baillet 


et al., 2001), where the objective is to infer the locations of the active sources on the cortical 


surface of the human brain based on electroencephalogram (EEG) measurements. In this 
application we model the brain using a discrete set of current dipoles distributed on the 
cortical surface and Maxwell’s equations then describe how the magnitudes of the dipole 
sources relate to the EEG signals measured at the scalp. We apply the proposed method 
to a benchmark EEG data set, where the subjects are presented with pictures of faces and 
scrambled faces. The data set is publicly available and the experimental paradigm is de¬ 


scribed in (Henson et ah, 2003). The data set has N = 128 electrodes and contains a total 


of 304 epochs with roughly 150 epochs of each of the two conditions, i.e. face or scrambled 
face. Each epoch has a duration of roughly 800ms corresponding to T = 161 samples in 
time. The stimuli is presented after 200ms. 


We use SPM8 http://www.fil.ion.ucl.ac.uk/spni/software/spni8/ to generate a for¬ 
ward model with 5124 dipole sources, i.e. A G ]^i28x5i24^ Again we choose the covariance 
matrix for T to be of the form Eq = k • Etemporai <8* ^Ispatiah where both the temporal and 
spatial component are squared exponential kernels with two separate length-scales. For 
simplicity, we use the Euclidean distance to compute the pairwise distances of the dipoles 
as opposed to the more advances approach, where the distances are computed within the 
manifold defined by the cortex. 


The resulting inverse problem has N = 128 measurements, T = 161 measurements vec¬ 
tors and D = 5124 unknowns per time point, i.e. a total of DT = 5124 • 151 = 824964 
unknowns. The forward model has a condition number of cond (A) = 3.1099 • 10^^. Thus, 
the problem instance is both heavily ill-posed and ill-conditioned. Because of the dimen¬ 
sions of this problem we only use the common precision approximation for this data set. In 
fact, a low rank approximation of the prior covariance matrix Eq will require 3961 eigen¬ 
vectors to explain 90% of the variance. The matrix U G ]^824964x3691 q£ eigenvectors would 
then require more than 20GB of memory to store in 64 bit double precision. Tuning the 
hyperparameters using the approximate marginal likelihood approximation leads to severe 
overfitting for this data set. Instead we chose the spatial length-scale to 10mm, the tem¬ 
poral length-scale to 50ms, the magnitude to k based on a crude cross validation scheme. 
The prior mean of T is chosen to t'o = 0 to reflect our ignorance of the number of active 
sources. Figure 10 shows the maximum a posteriori number of active sources as a function 
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time [s] 


Figure 10: The number of maximum a posteriori number of active sources as a function of 
time, i.e. ^ • Pi^t > 0.5 as a function of t. 


of time, i.e. we plot ^ • Pi ,t > 0.5. It is seen that the number of active sources is zero until 
roughly time t ~ 150ms, where the number of active sources increase and peaks t ~ 195ms. 



(a) (b) (c) (d) 

Figure 11: Posterior mean values for -ft at time t = 195ms. 


Figure 11 13 shows the posterior mean of -ft, Zt and mt, respectively, at 195ms, where 
the number of active sources peaked. The reconstructed activation probabilities are well- 
defined and nicely clustered as expected. Finally, in figure we investigate the time 
evolution of the two active sources marked by the arrows in figure The two signals peak 
at t = 165ms and t = 175ms, respectively, which is roughly consistent with the critical time 
scale of 170ms for face perception. Interestingly, we detect the four areas: left and right 


occipital and fusiform face areas that are associated with face perception (Henson et al. 


2009), see Figure 12 
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(a) (b) (c) (d) 

Figure 12: Posterior mean values for pt at time t = 195ms. 



Figure 13: Posterior mean values for nit at time t = 195ms. 



Figure 14: The time evolution of the two active sources marked in figure [TT| with arrows. 
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7. Summary and Outlook 

In this work we have addressed the problem of solving multiple underdetermined linear 
inverse problems subject to a sparsity constraint. We have proposed a new generalization 
of the spike and slab prior distribution to encode a priori correlation of the support of the 
solution in both space and time by imposing a transformed Gaussian process on the spike 
and slab probabilities. An expectation propagation (EP) algorithm for posterior inference 
under the proposed model has been derived. Computations involved in EP updates scale 
like O where D is the number of features and T is number of inverse problems, 

hence for large scale problems, the standard EP algorithm can be prohibitively slow. We 
therefore introduced three different approximation schemes for the covariance structure to 
reduce the computational complexity. First, assuming that the prior has a Kronecker de¬ 
composition brings complexity O , based on this decomposition, a further K-rank 

approximation brings a reduction of complexity to O we also proposed a common 

precision approximation of complexity O (^D‘^T + and finally a scheme based on 

spatio-temporal grouping of variables effectively reducing D and T by the grouping factor. 

We investigated the role of the spatio-temporal prior and the approximation schemes in 
a series of experiments. First we studied a simple ID problem with spatial, translational 
invariant smoothness of the support (single measurement case, T = 1) . For a signal with 
two small connected components in the support, we illustrate the solutions for variable 
smoothness of the prior. For a wide range of prior parameters the correct form of the sup¬ 
port is recovered, while the two support regions were found to merge in a single region as 
the smoothness length scale approaches the distance between the two regions. In a second 
experiment we investigated the role of the three covariance function approximations, also 
in a single measurement setup (T = 1). We found that all approaches accurately recon¬ 
struct the true simulated support and inverse problem solutions despite the approximation 
of the Gaussian process posterior. It is well-known that the quality of the inferred solutions 
strongly depends on both the undersampling ratio and the sparsity level of the true solu¬ 
tion. We investigated how the location of the phase transition is improved by invoking the 
smoothness prior. We found that the methods based on assumed prior correlation, were 
uniformly better than the methods with independent priors both in terms of the quality 
of normalized mean square error and in terms of their accuracy of support recovery (F- 
measure). The covariance approximation schemes are almost as fast as the scheme without 
smoothness, while yielding greatly improved performance. 

Next we investigated the benchmark phoneme recognition data set, which allowed us to 
compare with a number of existing schemes for the single measurement case (T = 1). In 
this experiment both N and D were relatively small and therefore we could run the full 
EP scheme as well. Full and approximate EP were found to compare well with published 
benchmarks. 

In a fifth experiment we studied the properties of the proposed algorithms in the spatio- 
temporal setting using simulated data. Signals were synthesized so that the support set 
showed non-stationarity, spatio-temporal correlation and so that the cardinality of the sup- 
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port set changed over time. We estimated prior hyperparameters by optimizing the approx¬ 
imate marginal likelihood and found they converged to optimal settings in all cases. We 
found that there was a good correspondence between the approximate marginal likelihood 
and the solution’s quantitative performance measure (NMSE). 

Also for the multiple measurement vector problem it is known that there is a phase transition¬ 
like dependence of the solution quality on undersampling ratio. In the sixth experiment we 
investigated how the location of the phase transition of the EP algorithms improved when 
the sparsity pattern of the underlying signal is smooth in both space and time for the mul¬ 
tiple measurements case. We compare our various approximate solvers with state-of-the-art 
tools based on approximate message passing: BG-AMP, DCS-AMP, both of which were 
informed about the true number of active coefficients and the true noise level. The full EP 
was too demanding to run for this problem. Significant improvement were found for the 
methods that exploited sparsity structure. Comparing performance with AMP methods, 
the EP methods performed best both in terms of identifying the support (E measure) and 
in terms of NMSE. Run times for the EP-methods were longer compared to the AMP-based 
methods, which have linear complexity in all dimensions. We noted importantly that the 
EP methods also can be used for more general forward model ensembles (A), while the 
AMP-based methods assume a Gaussian i.i.d. ensemble. 


In the final experiment, we applied the proposed methods to the hard problem of EEC 
source localization; data for this experiment was derived from a publicly available brain 


imaging data set designed to detect brain areas involved in face perception (Henson et al, 


2003). This was a larger scale application with N = 128 measurements and a total number 


of 824964 unknowns {DT = 5124 x 151), hence, only the common precision approximation 
was feasible. Eurthermore, the forward model was very ill-conditioned in contrast to the 
well-conditioned i.i.d. ensembles considered in the simulations. In spite of these challenges 
highly interesting results were obtained: All four main foci of activation as earlier detected 
by fMRI, but not in these EEG data by other inference schemes, were here found to have 
well-defined and spatially extended support by the new approximate inference scheme. In 
contrast to fMRI EEG allowed us to monitor the dynamics in these areas in high temporal 
resolution. 


Euture studies include an analysis of the phase transitions of the approximate marginal 
log likelihood in the hyperparameter space of the spatiotemporal prior as discussed in Ex¬ 
periment 1. Eurthermore, we plan to apply the proposed algorithms to other data sets, such 
as classification of fMRI task pattern data sets. Einally, we also plan to investigate the use 
of spatio-temporal sparsity priors for factor models, i.e. PGA and IGA. 
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Appendix A. Moments computations for f 2 ^t,j 

In this section we consider the update for the terms First we compute the 

so-called cavity distribution ztj) by removing the contribution of 

from the marginals of the joint approximation Q (x, z,^) 


f2,t,j ixtj,ztj) J\f [xtj\m2,t,j, Ber {zt,j\(l) {^2,tj)) 

= Ber (7^^’*’^')) , (76) 


where 


r,\2,tj _ 




1 -1 






- V2ljm2,t,j 


(77) 

(78) 

(79) 


Note that the cavity parameter for 7 for /2,t,j is simply equal to 73,t,j (and vice versa) since 
72 ,and 73,t,j are the only two terms contributing to 'jtj- 

Next, we minimize the KL-divergence between f 2 ,t,jQ^‘^’^'^ and q or equivalently match¬ 
ing the moments between the two distributions. Following the latter approach we first 
compute the (unnormalized) moment w.r.t. Zt^j 


2. - E /^.,iA aj {^.j. ‘ij) Q''"--'-’ dx 

= A7 - po, H\2>7i . 


(80) 


Next, the zeroth moment w.r.t Xt^i or the normalization constant of f 2 ,t,jQ'^‘^’^'^ 


Aq — ^ ^ f2,t,j {xtj, Ztj) (^i,i> 2^4,i) (^f) 

= ^ y [(1 - Ztj) 6{xtj) + ztjM {xt,j\po, 'To)] A7 (xt,j\m\‘^’^’^, H\2,ij^ Ber (^ztj\(j) 

= — (j) M + (j) J\f — pO) + tqJ 

= (l-(l) (7^^’*’^')) A7 (82) 
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We now compute the (unnormalized) first moment w.r.t. Xtj 


■^1 - E 2.,i) Q'"-’ (nj,ztj)dx,j 




= 0 (7^2’*’^) M (o|m\2Ai _ v\2,t, + 

_|_ pQV\‘2>id 


— -I- „ 

TO ' y\2,t,j 


= ^ . 

"^0 + 

and the second (unnormalized) moment w.r.t. Xtj 

^2 = ^ y xjf 2 ,i (Xi, Zi) (Xj, Zi) dXi 

= (j) A/" - po, + ro^ 


+ ^ \ 1 

y\2.i TO I _|_ J. 


A + ^ A I ^ 

TO y\2,i / To y\2.i 


= Zi 


7^\2,ijTo + \ _|_ 


To + y\ 2 .ij 


y\2,i + ro 


The central moments for Q* in eq. (28) are given by 


E[xi.,\^^. 


VI 1 ^2 


A' j] — Y ■ 


(83) 


(84) 


(85) 


Appendix B. Moment compntations for /s t j 

The moments matching for /s^tj is derived in a similar manner as for (see appendix [a| 
for details). First we compute the so-called cavity distribution [ztj,'yt,j) by removing 

the contribution of f 3 tj{ztj,jt,j) from the marginals of the joint approximation Q 


/Sjtj {ztj, 'ytj') Ber (ztj|(/) (73,47)) A/" p, 3 ,t,jt 
= . Ber (^ztj\(l) (7^^’*’^')) A (74,,'^ (§ 6 ) 


where 


= I - E^lj 


-1 


psAi = E''=>'''2' - i;3-‘73jj), 




(87) 

( 88 ) 

(89) 
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Once again we minimize the KL-divergence between and Q or equivalently 

matching the moments between the two distributions. We now compute the moments 
w.r.t. and Zj^t of the (unnormalized) tilde distribution 


^^ = 12 j ■ /s,*J {Zj,u 7j,t) {Zj,t, 7j,t) d7j-1 

■^1 ~ ^22 j {zj,t) lj,t) {zj,t, lj,t) d7j,t 


for m = 0,1, 2, 


(90) 

(91) 


We first compute the normalization constant of 


^0 — ^2 J {zt,j,'yt,j) Q'^^’ {ztJj'Ji) d'ftj 

= ^ / Ber {zt,j\(j){jt,j)) Ber (^ztj\(j) (7^^’*’^)) W (7 *^-d 7 tj 

= '22 j (1 “ 2 :*) (1 - (/> ( 7 j)) ( 7 ^^’*)) + (7i) 0 ( 7 ^^’*) ^ d 7 i 

</>(7\'’*)) I d7, + <))(7\3,.^ y())(7,)AA(7,|/l\3.*,s\3.*^ d7. 


Zi 

= (l-' 


and Williams, 2006), 


Integrals of the form f (^(7j)AA (7j|/l\^’*, d7i can be solved analytically (Rasmussen 


(j) (7i) AA ( 7 * 1 / 1 ^^’*, d7i = (j) (c3,i), C3,i = ^ ^ 

^ ^ V1 + S\3.' 


(92) 


Plugging this result back into the expression for Go yields 


Gq= (^7^^’*) )(!-<(' (c3,i)) + 4> (7^^’*) 4> {C3,i) ■ 


(93) 


We can now compute the moments of the unnormalized distribution 


^1 = '22 J (Zi, 7i) izi,7i) d7i 




(94) 


Then the first moment w.r.t. to Zi^t is obtained as E \zi^t\ = ZiIGq. 
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For the moments w.r.t. 7 *, we get 
Gi = ^ / iih,i 

Zi 

= X] /">'* (1 “ ^*) (1 “ (7i)) ( 7 ^^’*)) + (7i) </> ( 7 ^^’*) A/" (^7i|A^^’\ d7i 

= (1 - (/> ( 7 ^^’*)) /7i (1 - </>(7i))AA (^7i|/i\^’\s\^’*) d 7 i 


+ </' ( 7 ^^’*) J li4> in) d7i 

1 - (j) ( 7 ^^’*)) A^^’* - J li4> (70-^ (7i|A^^’A d7i 
<\> ( 7 ^^’*) j 'yi<l> in) d7i 


(95) 


Again we turn to ( [Rasmussen and Willianis , 2006) for the analytical solution of the above 
integrals 


7i</> ( 7 *)-^ f7i|A^^’A d7i = 4> (C3,i) + (k (C3,i) 


S\3,W(c3,i|0,l) 


(()(c3,j) \/l + S\3-* 

= </> (C3,i) A^^’* + <('(C3,i) <^3,0 (96) 


where we have defined 


d3,i = 


S\3.W (C3,i|0,l) 
(/>(c3,i) \/l + S\3,i 


(97) 


Plugging eq. (96) back into eq. rt95^ and simplifying yields 


Gi = (1 - 0 (7^^’*) ) (1 - (/>(c3,i)) A^"^’* - <(>(c3,A^^3,i +4>{l'''^’'')(t^{c3,i) A^'*’*+d3,i 


'A3:* 


;A3,i 


'A3,* 


= (1-0(7^^’*)) (1 - <(>(c3.i)) A''^’* - (1 - <(> (7^^’*) ) 9^(c3,*)c^3,i + A'''^’* + d3,i 


'A3,* 


A3,* 


'A3,* 


= (Go - Zi) - (1 - A ( 7 ^"’')) <t> {C3,i) d3,i + Zi [A\3,i ^ . 

= GoA^^’* + {‘^Zi — A (C3,j)) ^3,1 


(98) 


Thus, the first moment w.r.t. 'ji^t is given by E [ 7 *^ 4 ] = Gl/GO. 
Similarly, we compute the second moment w.r.t. 7 * 


G 2 = J 2 [ ^*^,^3,i {Zh li) {Zi,li) d7i 

= ( 1 -^( 7 ^^’*)) J 7,'(1-0(7A)AA(7.|A\'AS\3,*) d7* 

+ A (7^^’*) j in) AA (7*1 A^^’\ d7i 


(99) 
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The solution to the above integrals are given by (Rasmussen and Williams 2006) 


= 4> (C3,i) 




where 


63 ,* = 


(E\3.*^'c3,iAA (C3,,|0,l) 

</>(c3,i) (1 + S\3,i^ 


Furthermore, we define 


Plugging the above result back into eq. (IMl) and rearranging yields 



- f) {C3^i) Ws^i 

( 1 - 0 ( 7 ^^’*)) 

f (C3,j) W3^i 


(100) 


( 101 ) 


u;3,i ^ 2/1\3.* (^/i\3,* + ^3 . (102) 


+ 4> ( 7 ^^’*) 4> (C3,i) W3,i 

+ ZiW3^i 


(103) 


Thus, the second moment is given by E 
Q* then becomes 



G2/G0. Finally, the central moments of 


bj,t] 


Gi 

Go’ 


V[7i.i] 


G2_^ 
Go Gg’ 


E [zj^t] 


Go' 


(104) 


These moments completely determine the distribution and thus, we compute the 

updates for /s^j as follows 


vbr'EM- 

73T = 4</'(EN)>7\"" 


y^new _ 

^3,i — 


p new _ Y^new 

P3,i — ^3,j 


-1 


• \ -1 




\ 3 ,i 


(105) 

(106) 
(107) 
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